IRT . SakinehYeganegi
Item Response Theory classical test theory vs. latent trait measurement theory (IRT) The focused in the preceding chapter was on a loosely defined body of knowledge that might be designated'' classical'' measurement theory. Classical theory is concerned with an approach to item and test analysis that relies heavily on the correlation coefficient as a statistical procedure. This is evident in estimation of item discriminhility and test reliability and validity. Classical theory is further distinguishable by a mathematical interrelationship posited between true score, observed score, measurement error, reliability, and validity. The point here is that classical measurement theory is not the only way to approach the analysis and interpretation of item and test data. Indeed, recent development analysis from order perspective as well. Item response theory or latent trait theory, as it has been variously termed, is the most notable complementary perspective. One author has likened the advent of this new approach in measurement to the advent of nuclear physic in the world of physics(Warm, 1978). This analogy suggests that there may be some profound and powerful benefits from and awareness and an application of the more recent approach. Latent trait measurement or item response theory refers primarily, but not entirely, to three families of analytical procedures. These are identified as the one-parameter (or Rasch Model), the two- parameter and the three- parameter logical models( Rasch, 1960: Lord and Novick, 1968: Warm 1978: Hambelton and swaminadian, 1985). What these models have in common is a systematic procedure for considering and quantifying the probabiliry or improbabiliry of individual item and person response patterns given the overall pattern of responses in a set of test data. They also offer new and improved ways for the estimation of item difficulty and person ability. As the names suggest, these models differ in complexity depending on scale of person ability and item difficulty; the second parameter is a continuous estimate of discriminability ; the third parameter is an index of guessing. More will be said about the comparative differences of these three models, but it is appropriate at this point to consider what advantages may accrue through the use of these models. Advantages Offered by Latent Trait Theory Advantages available through the application of latent trait theory have been discussed in detail elsewhere (Henning, 1984). The advantages that are listed here do not apply equally to each of the three models named above. In general, these benefits are university available through the Rasch one –parameter modal, but may or may not be available in the other models. In this text priority of perspective has been given to the Rasch Model because it appears to be more readily applicable at the level of the small program or school. It is also easier for analysis, computation, and interpretation purposes, and its use is not dependent on the availability of mainframe computers. Subsequent space will be devoted to forther comparison of these alternative models. Sample-Free Item Calibration In classical measurement ,the estimated difficulty of any given item will vary widely with the average ability of the particular sample of examinees observed. When we report a difficulty or p-value we are always constrained to include a full description of the examinees, knowing that the difficulty index will vary for samples of different ability .It is not possible to compare item difficulties of different tests unless the same original sample is retested with both tests. Item analysis is sample-bound. In ;latent trait measurement we derive an item difficulty scale that is in one sense independent of ability differences of any particular sample of examinees drawn from the population of interest . This is a powerful advantages .It is analogous to stating that we can now carry a uniform measuring stick to measure person height without the need to bring along the last group of persons measured to determine whether standards are being maintained .In classical measurement ,items that are analyzed or normed with one group of examinees are forever suspect with any other group of examinees. Sample-free item calibration allows us to overcome this problem. Test-Free Person Measurement Ability measurement according to classical theory is dependent on the unique clustering of items in any given measurement instrument . it is not possible to administer one test of reading comprehension to person A and a different test of reading comprehensions to person B and then make direct comparisons of ability unless the test are pre- equated in away what would involve the administration of both test to the same large group of person. In latent trait measurement it is possible to compare abilities of person using different tests by referring to a small link of common bank of items or common persons. Once items have been calibrated and joined to a common persons. Once item have calibrated and joined to a common bank of items, any cluster of these items may be used to measure ability that would be located on the same scale ability measured by any other cluster of this items. It would no longer be important what the exact number of items used on a given instrument might be, nor what the unique clustering of these items with other items would be. This is intuitively satisfying, much the same as the belief that measurement of length should be independent of whether a one mater stick is used or a ten-ineter tape. Multiple Reliability Estimation In classical measurement theory, one global estimate of reliability is obtained by any appropriate method for any given test. While this is a useful procedure, it is not altogether satisfactory. Consider the fact that measurement of ability tends to be more reliable near the mean of the scoring distribution than at either end. This suggests that ability estimation varies in accuracy or reliability according to position along the scoring continuum. One global estimate of reliability should not be applied uniformly in evaluating the accuracy of scores for every individual examined. In latent trait measurement the standard error of measurement determined for every possible point along the scoring continuum. This standard error measure may be derived for estimates of both person ability and item difficulty. Thus with latent trait theory, reliability estimation goes beyond a global estimate for a given test, to a confidence estimate associated with every possible person and item score on that test. Identification of Guessers and other Deviant Respondents Earlier in this same text we encountered the lotion of correction for guessing. By use of the formulas provided it is possible to partially compensate for the guessing inherent in multiple- choice testing. These formulas penalize incorrect responses over and above the penalty for not attempting to respond. In some situations this procedure will improve test reliability. But, because every other wrong response, no attempt is made to differentiate between wrong responses that are truly blind guesses and wrong responses that are considered choices. It is not possible with classical measure-ment theory, therefore, to quantify the amount of guessing occurring for any given individual. Latent trait theory allows us to quantify the improbability of any response given knowledge of the difficulty of the item and of the ability of the person responding . If a person of low ability repeatedly passes items of high calibrated difficulty, it may be inferred that guessing is taking place. The lower the person's ability is with regard to the difficulty of the item passed, the more improbable the successful response. Usually, guessing is noticeable when an examinee passes items that have known difficulty greater than his or her known ability. In the three parameter latent trait model ,this kind of guessing is quantified directly for each examinee as one of the parameters of the model. In the one-and two-parameter models ,guessing is quantified along with other error sources as the index of person fit to the model. This index of fit ,referred to as the person fit statistic, lumps together guessing with all other sources of improbable response patterns .Persons may be ranked in terms of the degree of their misfit to the model. If a person is identified as misfitting the expectations of the model, then the test may not be valid for that person .Such persons are not identified as showing invalid responses under classical measurement theory. Misfitting the model might occur, by way of example ,when there is much guessing taking place ,when an examinee is not cooperating during the examination, when instructions are not clear, when the examinee copies the answers of someone with greater ability during some portion of the test ,when there is some perceptive handicap in the examinee (e.g., hearing or visual deficiencirs), and so on . If possible, it is advantageous to interview misfitting persons to determine some explanation for their improbable responses. Any test scores for such persons should be interpreted with caution. Potential Ease of Administration and Scoring Once items have been calibrated for difficulty it is possible to select items to match the known ability range of the examinees. Since only those items are used that are necessary to measure the ability of the examinees, many redundant or superfluous items can be deleted from the test .The result is a test that can be administered in less time, with less fatigue or boredom for the examinees, and with less expense for the examiners. And this can be accomplished without the sacrifice of test reliability and validity. The scoring process can also be made more efficient. In classical measurement ,first, the total raw score is computed as the sum of the correct items .Then, if indicated, some correction for guessing adjustment is applied. Finally, a scaled score conversion is made to enable reference across forms of the test. In latent trait measurement , since precalibrated item difficulties are used to define the variable, person ability inferences may be drawn directly from performance on any item task. This means that person ability can be determined without the need to compute a total score on the test. Economy of Items If only those items are used that approximate in difficulty the known ability region of the examinees, then fewer items will be required. If ability measurement can cease whenever examinee ability is estimated with predetermined levels of accuracy ,then fewer items will be needed for any given individual. If items are precalibrated, banked, and tandomly summoned for any given measurement task, then there is less risk of a security breakdown that would disqualify large numbers of items for future use. All of these advantages add up to greater economy of items over time and use. Reconciliation of Norm- Referenced and Criterion-Referenced Testing The distinctions between norm-referenced and criterion-referenced tests were discussed in chapter one .It was noted that the former approach references individual performance to that of group or norm , while the latter procedure references individual performance to an objective-based standard of content knowledge or skill . In latent trait measurement, all of the benefits of both approaches can be reaped in one and the same test. Since both person ability and item or task difficulty are positioned along the same latent continuum ,it is possible to draw inferences from examinee performance that are reference to the performances of other individuals or to the standards imposed by other tasks. This is a powerful advantages over a classical test theory which was not to reconcile these two approaches to measurement. Test Equating Facility In chapter six the notion of paraller forms of tests was introduced. There it was asserted that tests could be said to be statistically parallel or equivalent if they could demonstrate equal means, equal variances, and equal covariance's. Satisfying these rigorous criteria is sufficiently difficult that most test developers settle for equated, as opposed to equivalent, forms of tests . According to classical measurement theory, equated tests require that all test forms to be equated be administered in their entirety to the same large sample of examinees. Then ,assuming that the test forms are highly intercorrelated, some procedure must be adopted (e.g., regression or equipercentile methods) for the generation of a common set of scaled scores that can serve as a translation of comparative performance on the various tests. This is usually an expensive and time –consuming activity. It is often difficult to find a sufficiently large sample of persons who have time to participate in the administration of two or more tests within a short period of time .Even if such persons are found, the fatigue associated with the administration of so many test items in such a short time can often invalidate the overall results. Latent trait measurement theory can greatly facilitate equating of tests . By this approach ,it is no longer necessary to administer all forms of tests in their entirety to the original, large sample of examinees, By means of a group of common linking items (perhaps only ten or more) ,score on one test form can be equated with those on other forms, even though the other forms are administered to different samples of individuals drawn from the same general population at times. This powerful advantage greatly reduces the testing time of any individual examinee, while it increases the likelihood that valid participation will be elicited from appropriate samples of examinees. Test Tailoring Facility Once items have been calibrated on a latent trait continuum, it is possible to use those items in the construction of tests appropriate to specific measurement needs. Consider the following figure as an illustration of how this might be possible. Notice in this example we are considering cut-off scores for entrance into and exit from an English language institute prior to full university admission . FIGURE 1 An illustration of the advantages of test tailoring (a) Information levels for cut-off scores with an untailored standardized test (b) Information levels for cut-off scores with a tailored test Purposes of illustration ,a score of 67 is said to qualify a foreign student for entrance into the intensive English program of the institute, but not for full admission into university classes .Below 67, applicants are rejected from both the university and the language institute. A score of 82 is accepted as grounds for full admission into the university without remedial intensive English study. In Figure 1 (a) we see that a standardized proficiency exam has been administered with a corresponding normal distribution of the test information function (see page 54 for a discussion of information function). Maximum information about the examinees is available at the mean of the scoring distribution. The average respondent would have a 50 percent probability of success with items that have difficulty estimates falling near the mean of the scoring distribution. Comparatively less information is available at the critical decision cut-off points. We can see from this example that decision accuracy resulting from the application of his test will be low-especially at the 82-score university admission criterion. In Figure 8.2(b) we see the information function of a test that has been tailored for this particular decision-making situation. This test has been purposely loaded with items with difficulty calibrations falling near the 67-and 82-score decision points. Notice that this test has the same total number of items as the test of Figure 8.2(a), and probably the same overall information is made available for the same time of test administration .The important point here , and the purpose for this illustration, is that the information in the tailored test has been concentrated at the decision-making cut-off points. The tailored test will provide much greater decision accuracy than the standardized test . Fewer students will be wrongly admitted to or wrongly rejected from university study or intensive English study by use of the tailored test .This advantage may easily be gained through use of latent trait measurement theory. Item Banking Facility Once items have been calibrated according to latent trait or item response theory, they can be stored in an item bank according to a common metric of difficulty. This is generally true regardless of the equality of the ability or size of the subsequent person samples tested. The item bank becomes more than just a catalog of used items with descriptions of their successes and failures .It becomes an ever-expanding test which spans the latent ability continuum beyond the measurement needs of any one individual, but which may be accessed to gather items appropriate to any group of persons from the same general population with respect to the ability measured .Latent trait theory facilitates item banking by allowing all of he items to be calibrated and positioned on the same latent continuum by means of a common metric. Also ,it permits additional items to be added subsequently without the need to locate and retest the original sample of examinees. Further more, an item hank so maintained permits the construction of tests of known reliability and validity based on appropriate selection of item subsets from the bank without forther need for trial in the field. The Study of Item and Test Bias Prior to the advent of latent trait testing theory it was uncommon to find bias studies that attempted to quantify the amount and direction of bias for any given item or person. Bias was usually studied with regard to a test as an entire unit or with respect to a group or class of persons .One exception to this assertion concerns the practice of relying on a panel of experts to rate individual items as biased for or against some group of persons. Unfortunately ,expert opinion has not consistently been successful in identifying biased items. Latent trait methodology has the advantage that it permits the quantification of the magnitude and direction of bias for individual items or persons. This enables the correction of test bias whether through removal ,revision, or counterbalancing of biased items. Thus test bias may not only be neutralized through the removal of biased items, but also through the purposeful inclusion of items biased in the opposite direction . Elimination of Boundary Effects in Program Evaluation One of the persistent problems associated whit the analysis of learning gain in any language teaching program is the problem of instrument boundary effect. As students acquire or achieve greater proficiency or skill, the group mean performance score increases along the effective range of the test. When, in classical measurement, the group mean approaches the highest or lowest extremes of the effective range, the score distribution becomes skewed. When student become capable of scoring beyond the highest possible score on the test, a ceiling effect becomes apparent. The net result of these phenomena is that measurement of group learning gains over time becomes obscured. The instrumentation is no longer capable of accurately registering the mean learning gains that have taken place. With latent trait measurement boundary effects are removed. A logarithmic transformation is used to change the raw obtained scores to interval scores. The interval scores are adjusted to hold sample size, ability spread, test size and test variance constant. Ability measurement may then be articulated from one test to another or from one sample of examinees to another, because the same scale is in use in all cases and because both item difficulty and person ability are calibrated on this same scale. If any person in the test sample gets zero items correct or manages to get all items correct, no estimate is made of that person's ability .This is because it is acknowledged from the start that such a person may be almost infinitely weaker or more capable than the test is able to measure. Similarly, if any item is missed by all persons or is gotten correct by all persons, no estimate is made of that item's difficulty. There is no way to gauge the item difficulty accurately since both success and failure were not experienced whit the item. In the case of persons who fail to experience both success and failure with the items of a test, a search is made for items of greater or lesser difficulty as required so that ability estimation may occur. For the calibration of item difficulty when items are uniformly passed or failed by all persons in the sample, a search is made for persons of lesser or greater ability until at least one person passes and one person fails each item. When program evaluation is made only with persons and items that exhibit both success and failure in this manner, and when sample size, dispersion, and central tendency are transformed to articulate to the same interval scale, then boundary effects cease to exist. Item and Person Fit Validity Measures In classical measurement theory ,criterion-related or construct validity of an item may be ascertained through correlational methods ,given an appropriate criterion. The actual response validity associated with a person or an item is not estimated per se. Latent trait measurement provides this valuable additional information .Due to the prohabilistic nature of latent trait models, it is possible to quantify for any person or item the magnitude of the departure of the given pattern of responses from the pattern predicted by the model .This departure or unlikelihood statement is a kind of response validity or model fit validity estimate that is available for both persons and items without the need to go outside of the given sample of persons and items in search of a criterion. Score Reporting Facility Score reporting might be facilitated through latent trait measurement